Group No.: 17
Member: ZHU Lei (EID: lzhu68, SID: 55883618)
In Section 1, exploratory data analysis is provided. Specifically, sevaral sample images are shown, per-channel intensity distribution and label distribution are visualized with plotly.
In Section 2, misllaneous utility functions are defined.
In Seciton 3, sevearal solutions are tried, speifically:
In Section 3.1, bag of words BoW feature and two machine learning classifiers (logistic regression and RBF kernel SVM) are used. Moreover, two kinds of local descriptor, SIFT and ORB are used for visual word extraction. Visual words are derived by KMeans clustering. Different vocabulary sizes are tried. The best validation accuracy I got in this part is 0.657330.
In Section 3.2, I use 3 CNN architectures (MobileNetV2, InceptionResNetV2, VGG16) to extract deep feature, and applied dimenstion reduction (Kernel PCA and NFM) followed with classifier (One of inearSVM, rbfSVM, LR) to do classification. The best combination (MobileNetV2 feature + no dimension reduction + LR) gives validation accuracy 0.862716.
In Section 3.3, I finetuned 4 CNN architectures (ResNet101V2, MobileNetV2, InceptionResNetV2, VGG16) end-to-end. The best one (InceptionResNetV2) gives validation accuracy 0.974453.
In Section 3.4, I retried finetuning CNN architectures mentioned above after the backgroud of input images is removed with GrabCut algorithm. The best validation score I derived in this section is
In Section 4, several best results are ensembled to get final submission.
# !pip install -q efficientnet
# !pip install opencv-python==3.4.2.17
# !pip install opencv-contrib-python==3.4.2.17
# !conda install tensorflow-gpu -y
# !conda install keras=2.3.1 -y
# !conda install pandas -y
# !conda install tqdm -y
# !conda install scikit-learn -y
# !conda install plotly -y
import os
import gc
import re
import cv2
import math
import numpy as np
import scipy as sp
import pandas as pd
import tensorflow as tf
from IPython.display import SVG
import efficientnet.tfkeras as efn
from keras.utils import plot_model
import tensorflow.keras.layers as L
from keras.utils import model_to_dot
import tensorflow.keras.backend as K
from tensorflow.keras.models import Model
# from kaggle_datasets import KaggleDatasets
from tensorflow.keras.applications import InceptionResNetV2
# import seaborn as sns
from tqdm import tqdm
import matplotlib.cm as cm
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, GridSearchCV, ParameterGrid
tqdm.pandas()
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from collections import OrderedDict
from sklearn.decomposition import PCA, KernelPCA, TruncatedSVD, NMF
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn import cluster
from joblib import parallel_backend, Parallel, delayed
# import warnings
# warnings.filterwarnings("ignore")
EPOCHS = 20
SAMPLE_LEN = 100
IMAGE_PATH = "../input/plant-pathology-2020-fgvc7/images/"
TEST_PATH = "../input/plant-pathology-2020-fgvc7/test.csv"
TRAIN_PATH = "../input/plant-pathology-2020-fgvc7/train.csv"
SUB_PATH = "../input/plant-pathology-2020-fgvc7/sample_submission.csv"
sub = pd.read_csv(SUB_PATH)
test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)
train_data.head()
test_data.head()
def load_image(image_id):
file_path = image_id + ".jpg"
image = cv2.imread(IMAGE_PATH + file_path)
return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
train_images = train_data["image_id"][:SAMPLE_LEN].progress_apply(load_image)
fig = px.imshow(cv2.resize(train_images[0], (205, 136)))
fig.show()
I have plotted the first image in the training data above (the RGB values can be seen by hovering over the image). The green parts of the image have very low blue values, but by contrast, the brown parts have high blue values. This suggests that green (healthy) parts of the image have low blue values, whereas unhealthy parts are more likely to have high blue values. This might suggest that the blue channel may be the key to detecting diseases in plants.
red_values = [np.mean(train_images[idx][:, :, 0]) for idx in range(len(train_images))]
green_values = [np.mean(train_images[idx][:, :, 1]) for idx in range(len(train_images))]
blue_values = [np.mean(train_images[idx][:, :, 2]) for idx in range(len(train_images))]
values = [np.mean(train_images[idx]) for idx in range(len(train_images))]
fig = ff.create_distplot([values], group_labels=["Channels"], colors=["purple"])
fig.update_layout(showlegend=False, template="simple_white")
fig.update_layout(title_text="Distribution of channel values")
fig.data[0].marker.line.color = 'rgb(0, 0, 0)'
fig.data[0].marker.line.width = 0.5
fig
The channel values seem to have a roughly normal distribution centered around 105. The maximum channel activation is 255. This means that the average channel value is less than half the maximum value, which indicates that channels are minimally activated most of the time.
fig = ff.create_distplot([red_values], group_labels=["R"], colors=["red"])
fig.update_layout(showlegend=False, template="simple_white")
fig.update_layout(title_text="Distribution of red channel values")
fig.data[0].marker.line.color = 'rgb(0, 0, 0)'
fig.data[0].marker.line.width = 0.5
fig
The red channel values seem to roughly normal distribution, but with a slight rightward (positive skew). This indicates that the red channel tends to be more concentrated at lower values, at around 100. There is large variation in average red values across images.
fig = ff.create_distplot([green_values], group_labels=["G"], colors=["green"])
fig.update_layout(showlegend=False, template="simple_white")
fig.update_layout(title_text="Distribution of green channel values")
fig.data[0].marker.line.color = 'rgb(0, 0, 0)'
fig.data[0].marker.line.width = 0.5
fig
The green channel values have a more uniform distribution than the red channel values, with a smaller peak. The distribution also has a leftward skew (in contrast to red) and a larger mode of around 140. This indicates that green is more pronounced in these images than red, which makes sense, because these are images of leaves!
fig = ff.create_distplot([blue_values], group_labels=["B"], colors=["blue"])
fig.update_layout(showlegend=False, template="simple_white")
fig.update_layout(title_text="Distribution of blue channel values")
fig.data[0].marker.line.color = 'rgb(0, 0, 0)'
fig.data[0].marker.line.width = 0.5
fig
The blue channel has the most uniform distribution out of the three color channels, with minimal skew (slight leftward skew). The blue channel shows great variation across images in the dataset.
fig = go.Figure()
for idx, values in enumerate([red_values, green_values, blue_values]):
if idx == 0:
color = "Red"
if idx == 1:
color = "Green"
if idx == 2:
color = "Blue"
fig.add_trace(go.Box(x=[color]*len(values), y=values, name=color, marker=dict(color=color.lower())))
fig.update_layout(yaxis_title="Mean value", xaxis_title="Color channel",
title="Mean value vs. Color channel", template="plotly_white")
fig = ff.create_distplot([red_values, green_values, blue_values],
group_labels=["R", "G", "B"],
colors=["red", "green", "blue"])
fig.update_layout(title_text="Distribution of red channel values", template="simple_white")
fig.data[0].marker.line.color = 'rgb(0, 0, 0)'
fig.data[0].marker.line.width = 0.5
fig.data[1].marker.line.color = 'rgb(0, 0, 0)'
fig.data[1].marker.line.width = 0.5
fig.data[2].marker.line.color = 'rgb(0, 0, 0)'
fig.data[2].marker.line.width = 0.5
fig
From the above plots, we can clearly see which colors are more common and which ones less common in the leaf images. Green is the most pronounced color, followed by red and blue respectively. The distributions, when plotted together, appear to have a similar shape, but shifted horizontally.
Now, I will visualize sample leaves beloning to different categories in the dataset.
def visualize_leaves(cond=[0, 0, 0, 0], cond_cols=["healthy"], is_cond=True):
if not is_cond:
cols, rows = 3, min([3, len(train_images)//3])
fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(30, rows*20/3))
for col in range(cols):
for row in range(rows):
ax[row, col].imshow(train_images.loc[train_images.index[-row*3-col-1]])
return None
cond_0 = "healthy == {}".format(cond[0])
cond_1 = "scab == {}".format(cond[1])
cond_2 = "rust == {}".format(cond[2])
cond_3 = "multiple_diseases == {}".format(cond[3])
cond_list = []
for col in cond_cols:
if col == "healthy":
cond_list.append(cond_0)
if col == "scab":
cond_list.append(cond_1)
if col == "rust":
cond_list.append(cond_2)
if col == "multiple_diseases":
cond_list.append(cond_3)
data = train_data[:100]
# print(len(data))
for cond in cond_list:
data = data.query(cond)
# print(list(data.index))
images = train_images.loc[list(data.index)]
cols, rows = 3, min([3, len(images)//3])
fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(30, rows*20/3))
for col in range(cols):
for row in range(rows):
ax[row, col].imshow(images.loc[images.index[row*3+col]])
plt.show()
visualize_leaves(cond=[1, 0, 0, 0], cond_cols=["healthy"])
In the above images, we can see that the healthy leaves are completely green, do not have any brown/yellow spots or scars. Healthy leaves do not have scab or rust.
visualize_leaves(cond=[0, 1, 0, 0], cond_cols=["scab"])
In the above images, we can see that leaves with "scab" have large brown marks and stains across the leaf. Scab is defined as "any of various plant diseases caused by fungi or bacteria and resulting in crustlike spots on fruit, leaves, or roots. The spots caused by such a disease". The brown marks across the leaf are a sign of these bacterial/fungal infections. Once diagnosed, scab can be treated using chemical or non-chemical methods.
visualize_leaves(cond=[0, 0, 1, 0], cond_cols=["rust"])
In the above images, we can see that leaves with "rust" have several brownish-yellow spots across the leaf. Rust is defined as "a disease, especially of cereals and other grasses, characterized by rust-colored pustules of spores on the affected leaf blades and sheaths and caused by any of several rust fungi". The yellow spots are a sign of infection by a special type of fungi called "rust fungi". Rust can also be treated with several chemical and non-chemical methods once diagnosed.
visualize_leaves(cond=[0, 0, 0, 1], cond_cols=["multiple_diseases"])
In the above images, we can see that the leaves show symptoms for several diseases, including brown marks and yellow spots. These plants have more than one of the above-described diseases.
Now, I will visualize the label distribution of training data using pipe chart.
fig = go.Figure([go.Pie(labels=train_data.columns[1:],
values=train_data.iloc[:, 1:].sum().values)])
fig.update_layout(title_text="Pie chart of targets", template="simple_white")
fig.data[0].marker.line.color = 'rgb(0, 0, 0)'
fig.data[0].marker.line.width = 0.5
fig.show()
In the pie chart above, we can see that most leaves in the dataset are unhealthy (71.7%). Only 5% of plants have multiple diseases, and "rust" and "scab" occupy approximately one-third of the pie each. In short:
We may need to apply balanced weight or resampling strategy to handle such imbalance.
# TPU or GPU detection
# Detect hardware, return appropriate distribution strategy
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
strategy = tf.distribute.get_strategy()
def seed_everything(seed=0):
np.random.seed(seed)
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
SEED=2048
seed_everything(SEED)
print("REPLICAS: ", strategy.num_replicas_in_sync)
print("GPUs: {}".format(tf.config.experimental.list_physical_devices('GPU')))
# # Data access
# GCS_DS_PATH = KaggleDatasets().get_gcs_path()
# Configuration
AUTO = tf.data.experimental.AUTOTUNE
EPOCHS = 40
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
VALIDATION_SIZE = 0.15
IMAGE_SIZE = 400
# multiprocessing
N_JOBS=-1
def format_path(st):
return IMAGE_PATH + st + '.jpg'
test_paths = test_data.image_id.apply(format_path).values
trainval_paths = train_data.image_id.apply(format_path).values
trainval_labels = np.float32(train_data.loc[:, 'healthy':'scab'].values)
train_paths, valid_paths, train_labels, valid_labels =\
train_test_split(trainval_paths, trainval_labels, test_size=VALIDATION_SIZE,
random_state=SEED)
print('train samples: ', len(train_paths))
print('valid samples: ', len(valid_paths))
print('test samples: ', len(test_paths))
print('path example: ', train_paths[0])
print('label example: ', train_labels[0])
def decode_image(filename, label=None, image_size=(IMAGE_SIZE, IMAGE_SIZE)):
bits = tf.io.read_file(filename)
image = tf.image.decode_jpeg(bits, channels=3)
# image = tf.cast(image, tf.float32) / 255.0
# https://www.tensorflow.org/tutorials/images/transfer_learning
# https://github.com/keras-team/keras-applications/blob/bc89834ed36935ab4a4994446e34ff81c0d8e1b7/keras_applications/imagenet_utils.py#L42
image = tf.cast(image, tf.float32)
image = (image/127.5) - 1
image = tf.image.resize(image, image_size)
if label is None:
return image
else:
return image, label
def data_augment(image, label=None):
image = tf.image.random_flip_left_right(image)
image = tf.image.random_flip_up_down(image)
if label is None:
return image
else:
return image, label
def display_training_curves(training, validation, title, subplot):
"""
Source: https://www.kaggle.com/mgornergoogle/getting-started-with-100-flowers-on-tpu
"""
if subplot%10==1: # set up the subplots on the first call
plt.subplots(figsize=(10,10), facecolor='#F0F0F0')
plt.tight_layout()
ax = plt.subplot(subplot)
ax.set_facecolor('#F8F8F8')
ax.plot(training)
ax.plot(validation)
ax.set_title('model '+ title)
ax.set_ylabel(title)
#ax.set_ylim(0.28,1.05)
ax.set_xlabel('epoch')
ax.legend(['train', 'valid.'])
def get_backbone(cnn='VGG16'):
assert cnn in ['ResNet101V2', 'VGG16', 'InceptionResNetV2', 'MobileNetV2']
if cnn == 'ResNet101V2':
backbone = tf.keras.applications.ResNet101V2(
input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
weights='imagenet',
include_top=False)
if cnn == 'VGG16':
backbone = tf.keras.applications.VGG16(
input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
weights='imagenet',
include_top=False)
if cnn == 'InceptionResNetV2':
backbone = tf.keras.applications.InceptionResNetV2(
input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
weights='imagenet',
include_top=False)
if cnn == 'MobileNetV2':
backbone = tf.keras.applications.MobileNetV2(
input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
weights='imagenet',
include_top=False)
return backbone
def get_classifier(name='linearSVM'):
if name == 'linearSVM':
# return LinearSVC(class_weight='balanced',
# probability=True)
return SVC(kernel='linear',
class_weight='balanced',
probability=True)
if name == 'rbfSVM':
return SVC(kernel='rbf',
class_weight='balanced',
probability=True)
if name == 'LR':
return LogisticRegression()
def get_dim_reductor(name='PCA_128'):
method, n_components = name.split('_')
n_components = int(n_components)
# print(method, n_components)
if method == 'PCA':
return PCA(n_components=n_components)
# return KernelPCA(n_components=n_components)
if method == 'KPCA':
return KernelPCA(kernel='rbf', n_components=n_components)
if method == 'LSA':
return TruncatedSVD(n_components=n_components,
random_state=SEED)
if method == 'NMF':
return NMF(n_components=n_components)
trainval_dataset = (
tf.data.Dataset
.from_tensor_slices((trainval_paths, trainval_labels))
.map(decode_image, num_parallel_calls=AUTO)
.cache()
.map(data_augment, num_parallel_calls=AUTO)
.shuffle(512)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
train_dataset = (
tf.data.Dataset
.from_tensor_slices((train_paths, train_labels))
.map(decode_image, num_parallel_calls=AUTO)
.cache()
.map(data_augment, num_parallel_calls=AUTO)
.repeat()
.shuffle(512)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
train_dataset_1 = (
tf.data.Dataset
.from_tensor_slices((train_paths, train_labels))
.map(decode_image, num_parallel_calls=AUTO)
.cache()
.map(data_augment, num_parallel_calls=AUTO)
.repeat()
.shuffle(512)
.batch(64)
.prefetch(AUTO)
)
valid_dataset = (
tf.data.Dataset
.from_tensor_slices((valid_paths, valid_labels))
.map(decode_image, num_parallel_calls=AUTO)
.batch(BATCH_SIZE)
.cache()
.prefetch(AUTO)
)
test_dataset = (
tf.data.Dataset
.from_tensor_slices(test_paths)
.map(decode_image, num_parallel_calls=AUTO)
.map(data_augment, num_parallel_calls=AUTO)
.batch(BATCH_SIZE)
)
ckpt_dir = '../output/best_models'
submission_dir = '../output/submissions'
os.makedirs(ckpt_dir, exist_ok=True)
os.makedirs(submission_dir, exist_ok=True)
class BoW(object):
def __init__(self, local_feature='SIFT', vsize=3):
if local_feature == 'SIFT':
self.local_feature_extractor = cv2.xfeatures2d.SIFT_create()
if local_feature == 'SURF':
self.local_feature_extractor = cv2.xfeatures2d.SURF_create()
if local_feature == 'ORB':
self.local_feature_extractor = cv2.ORB_create()
self.vsize = vsize
self.kmeans = None
def fit(self, im_paths):
# des_mat_list = [ self.get_local_feature_by_path(im_path)[1] for
# im_path in tqdm(im_paths, desc='Extrcting feature points') ]
des_mat_list = Parallel(n_jobs=N_JOBS, backend='threading')\
(delayed(self.get_local_feature_by_path)(im_path)
for im_path in tqdm(im_paths, desc='Extrcting feature points')
)
des_mat_all = np.concatenate(des_mat_list, axis=0)
# print(f'{len(des_mat_all):d} key points have been extracted!')
# print('fitting kmeans to get codebook')
self.kmeans = cluster.MiniBatchKMeans(n_clusters=self.vsize,
init_size=10*self.vsize,
batch_size=self.vsize,
random_state=SEED)
self.kmeans.fit(des_mat_all)
# print('finish building codebook')
def transform(self, im_paths):
# print('building training feature matrix...')
# bow_matrix = [ self.get_bow_feature_vec(im_path) for \
# im_path in tqdm(im_paths, desc='Extracting BoW feature') ]
bow_matrix = Parallel(n_jobs=N_JOBS, backend='threading')\
(delayed(self.get_bow_feature_vec)(im_path)
for im_path in tqdm(im_paths, desc='Extracting BoW feature')
)
bow_matrix = np.stack(bow_matrix)
return bow_matrix
def fit_transform(self, im_paths):
self.fit(im_paths)
return self.transform(im_paths)
def get_local_feature_by_path(self, im_path, ret_kp=False):
img = cv2.imread(im_path)
im_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kp, des = self.local_feature_extractor.detectAndCompute(im_gray, None)
if ret_kp:
return kp, des
return des
def get_bow_feature_vec(self, im_path):
des_mat = self.get_local_feature_by_path(im_path)
word_idx = self.kmeans.predict(des_mat)
hist = np.bincount(word_idx.ravel(), minlength=self.vsize)
return hist
trainvalY = np.argmax(trainval_labels, 1)
record_ls = []
bow_param_grid = {
# 'local_feature': ['ORB', 'SIFT', 'SURF'],
'local_feature': ['ORB', 'SIFT'],
'vsize': [10, 20, 50, 100, 200, 500],
}
# classifier_names = ['linearSVM', 'rbfSVM', 'LR']
# there is a bug in sklearn that, if use linearSVM + GridSearchCV
# the girdserch will get stuck
classifier_names = ['rbfSVM', 'LR']
# bow_param_grid = {'local_feature': ['ORB'],
# 'vsize': [100],
# }
bow_param_combs = list(ParameterGrid(bow_param_grid))
for comb in bow_param_combs:
bow = BoW(**comb)
trainvalXf = bow.fit_transform(trainval_paths)
testXf = bow.transform(test_paths)
for cls_name in classifier_names:
classifier = GridSearchCV(get_classifier(cls_name),
{'C': np.logspace(-4, 4, 20)},
scoring='accuracy',
n_jobs=N_JOBS,
verbose=True)
# classifier = get_classifier(cls_name)
classifier.fit(trainvalXf, trainvalY)
score = classifier.best_score_
record = OrderedDict()
record['local_feature'] = comb['local_feature']
record['vsize'] = comb['vsize']
record['classifier'] = cls_name
record['valid_acc'] = score
record_ls.append(record)
probs = classifier.predict_proba(testXf)
# print(probs.shape)
sub.loc[:, 'healthy':] = probs
sub.to_csv(os.path.join(submission_dir,
'BoW-{}-vsize{:d}-{}.csv'\
.format(comb['local_feature'],
comb['vsize'],
cls_name
)
),
index=False)
print('({}, {}, {}): {:.4f}'.format(comb['local_feature'],
comb['vsize'],
cls_name,
score))
# del classifier
# gc.collect()
report_df = pd.DataFrame(record_ls)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(report_df)
report_df.to_csv(os.path.join('../output', f'BoW_feature_report.csv'), index=False)
The best score is 0.657330, which is not satisfactory. This may be due to the representation ability of BoW is not enough.
# a strange bug: when I use N_JOBS=-1, the fitting stucks
class HybridClassfier(object):
def __init__(self, cnn='VGG16', classifier='linearSVM', dim_reductor='PCA_128', cache_dir='feature_cache'):
# self.cache_dir = cachedir
# os.makedir(self.cache_dir, exist_ok=True)
backbone = get_backbone(cnn)
self.feature = tf.keras.Sequential([
backbone,
L.GlobalMaxPooling2D()
])
if dim_reductor != 'none':
dim_red = get_dim_reductor(dim_reductor)
classifier = get_classifier(classifier)
pipe = Pipeline(steps=[('dimred', dim_red),
# ('normalizer', Normalizer()),
('cls', classifier)])
param_grid = {'cls__C': np.logspace(-3, 3, 13)}
self.classifier = GridSearchCV(pipe,
param_grid,
scoring='accuracy',
n_jobs=N_JOBS,
verbose=True)
else:
classifier = get_classifier(classifier)
pipe = Pipeline(steps=[('scaler', StandardScaler()),
('cls', classifier)])
param_grid = {'cls__C': np.logspace(-3,3,13)}
self.classifier = GridSearchCV(pipe,
param_grid,
scoring='accuracy',
n_jobs=N_JOBS,
verbose=True)
def fit(self, train_data):
Xf = []
Y = []
print('extracting feature...')
for image_batch, label_batch in train_data:
# print(self.feature(image_batch).shape)
Xf.append(self.feature(image_batch))
Y.append(label_batch.numpy())
# print(len(Xf))
Xf = np.concatenate(Xf, 0)
Y = np.argmax(np.concatenate(Y, 0), 1)
# print(Xf.shape)
# print(Xf[:5])
# print(Y[:5])
with parallel_backend('loky'):
self.classifier.fit(Xf, Y)
return self.classifier.best_score_
def predict(self, test_data):
Xf = []
for image_batch in test_data:
Xf.append(self.feature(image_batch))
Xf = np.concatenate(Xf, 0)
return self.classifier.predict_proba(Xf)
def close(self):
# release memory
del self.feature
K.clear_session()
gc.collect()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.close()
# print('closed successfully')
# def __del__(self):
# self.close()
record_ls = []
comb_grid = {'cnn': ['VGG16', 'InceptionResNetV2', 'MobileNetV2'],
'classifier': ['LR', 'linearSVM', 'rbfSVM'],
# 'classifier': ['LR'],
'dim_reductor': ['none', 'NMF_32', 'NMF_64', 'NMF_128',
'KPCA_32', 'KPCA_64', 'KPCA_128']}
param_combs = list(ParameterGrid(comb_grid))
for comb in param_combs:
with HybridClassfier(**comb) as classifier:
print('current combination: ', comb)
record = OrderedDict()
# classifier = HybridClassfier(**comb)
score = classifier.fit(trainval_dataset)
record['cnn'] = comb['cnn']
record['dim_reductor'] = comb['dim_reductor']
record['classifier'] = comb['classifier']
record['valid_acc'] = score
record_ls.append(record)
print('current score: ', score)
probs = classifier.predict(test_dataset)
sub.loc[:, 'healthy':] = probs
sub.to_csv(os.path.join(submission_dir,
'{}-{}-{}.csv'\
.format(comb['cnn'], comb['dim_reductor'], comb['classifier'])),
index=False)
report_df = pd.DataFrame(record_ls)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(report_df)
report_df.to_csv(os.path.join('../output', f'cnn_feature_report.csv'), index=False)
Using deep feature, we get up to 0.862716 validation accuracy, better than we got with BoW feature. To further improve performance, we can try finetuning the whole CNN so that the feature can adapt for this task.
CNNs to be finetuned:
The first three models are heavy architectures, while the last, MobileNetV2 is light-weight. All available pretrained CNNs are here: Module: tf.keras.applications
LR_START = 0.0001
LR_MAX = 0.00005 * 8
LR_MIN = 0.0001
LR_RAMPUP_EPOCHS = 4
LR_SUSTAIN_EPOCHS = 6
LR_EXP_DECAY = .8
def lrfn(epoch):
if epoch < LR_RAMPUP_EPOCHS:
lr = (LR_MAX - LR_START) / LR_RAMPUP_EPOCHS * epoch + LR_START
elif epoch < LR_RAMPUP_EPOCHS + LR_SUSTAIN_EPOCHS:
lr = LR_MAX
else:
lr = (LR_MAX - LR_MIN) * LR_EXP_DECAY**(epoch - LR_RAMPUP_EPOCHS - LR_SUSTAIN_EPOCHS) + LR_MIN
return lr
lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=True)
rng = [i for i in range(EPOCHS)]
y = [lrfn(x) for x in rng]
plt.plot(rng, y)
print("Learning rate schedule: {:.3g} to {:.3g} to {:.3g}".format(y[0], max(y), y[-1]))
record_ls = []
for cnn in ['ResNet101V2', 'VGG16', 'InceptionResNetV2', 'MobileNetV2']:
print(f'Finitune {cnn}...')
record = OrderedDict()
with strategy.scope():
# build model
backbone = get_backbone(cnn)
model = tf.keras.Sequential([
backbone,
L.GlobalMaxPooling2D(),
# L.Dropout(0.3),
L.Dense(4, activation='softmax')
])
model.compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics=['categorical_accuracy']
)
model.summary()
ckpt_path = os.path.join(ckpt_dir,
f'finetuned_{cnn}.h5')
checkpoint = tf.keras.callbacks.ModelCheckpoint(
ckpt_path,
verbose=1,
monitor='val_categorical_accuracy',
save_best_only=True,
mode='auto')
STEPS_PER_EPOCH = train_labels.shape[0] // BATCH_SIZE
history = model.fit(
train_dataset,
epochs=EPOCHS,
callbacks=[lr_callback, checkpoint],
steps_per_epoch=STEPS_PER_EPOCH,
validation_data=valid_dataset
)
# display training curves
display_training_curves(
history.history['loss'],
history.history['val_loss'],
'loss', 211)
display_training_curves(
history.history['categorical_accuracy'],
history.history['val_categorical_accuracy'],
'accuracy', 212)
plt.show()
record['model'] = f'finetuned_{cnn}'
best_idx = np.argmax(history.history['val_categorical_accuracy'])
record['train_loss'] = history.history['loss'][best_idx]
record['valid_loss'] = history.history['val_loss'][best_idx]
record['train_acc'] = history.history['categorical_accuracy'][best_idx]
record['valid_acc'] = history.history['val_categorical_accuracy'][best_idx]
record_ls.append(record)
# run testing with best model weights
model.load_weights(ckpt_path)
print('record: ', record['valid_loss'], record['valid_acc'])
# val_loss, val_acc = model.evaluate(valid_dataset)
# print('confirmation: ', val_loss, val_acc)
print('Start inference on test dataset.')
probs = model.predict(test_dataset, verbose=1)
sub.loc[:, 'healthy':] = probs
sub.to_csv(os.path.join(submission_dir, f'finetune_{cnn}.csv'), index=False)
# sub.head()
# release memory
# https://forums.fast.ai/t/how-could-i-release-gpu-memory-of-keras/2023/19
del model
K.clear_session()
gc.collect()
report_df = pd.DataFrame(record_ls)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(report_df)
report_df.to_csv(os.path.join('../output', f'fintune_cnn_report.csv'), index=False)
Compelling result is derived by end-to-end finetuning CNN. Can we do better by preprocessing input images?
The background may distract disease detection and classification, so let's try if backgroud removal helps. Following victorlouisdg's kernel, I use grabcut to do this job.
from mpl_toolkits.axes_grid1 import ImageGrid
bg_removed_img_dir = '../input/images_no_bg'
os.makedirs(bg_removed_img_dir)
def init_grabcut_mask(h, w):
mask = np.ones((h, w), np.uint8) * cv2.GC_PR_BGD
mask[h//4:3*h//4, w//4:3*w//4] = cv2.GC_PR_FGD
mask[2*h//5:3*h//5, 2*w//5:3*w//5] = cv2.GC_FGD
return mask
plt.imshow(init_grabcut_mask(3*136, 3*205))
def remove_background(image):
h, w = image.shape[:2]
mask = init_grabcut_mask(h, w)
bgm = np.zeros((1, 65), np.float64)
fgm = np.zeros((1, 65), np.float64)
cv2.grabCut(image, mask, None, bgm, fgm, 1, cv2.GC_INIT_WITH_MASK)
mask_binary = np.where((mask == 2) | (mask == 0), 0, 1).astype('uint8')
result = cv2.bitwise_and(image, image, mask = mask_binary)
# add_contours(result, mask_binary) # optional, adds visualizations
return result
# visualize samples
num_show = 5
rows, cols = (num_show, 2)
axes_pad = 0.2
fig_h = 4.0 * rows + axes_pad * (rows-1)
fig_w = 4.0 * cols + axes_pad * (cols-1)
fig = plt.figure(figsize=(fig_w, fig_h))
grid = ImageGrid(fig, 111, nrows_ncols=(rows, cols), axes_pad=0.2)
for i, ax in enumerate(grid):
img_path = trainval_paths[i//2]
img = cv2.resize(cv2.imread(img_path), (IMAGE_SIZE, IMAGE_SIZE))
if i % 2 == 1:
img = remove_background(img)
ax.imshow(img[:, :, ::-1])
for img_path in tqdm(trainval_paths):
img = cv2.resize(cv2.imread(img_path), (IMAGE_SIZE, IMAGE_SIZE))
nobg = remove_background(img)
cv2.imwrite(os.path.join(bg_removed_img_dir,
os.path.basename(img_path)),
nobg)
for img_path in tqdm(test_paths):
img = cv2.resize(cv2.imread(img_path), (IMAGE_SIZE, IMAGE_SIZE))
nobg = remove_background(img)
cv2.imwrite(os.path.join(bg_removed_img_dir,
os.path.basename(img_path)),
nobg)
def format_path_nobg(st):
return os.path.join(bg_removed_img_dir, st + '.jpg')
test_paths_new = test_data.image_id.apply(format_path_nobg).values
trainval_paths_new = train_data.image_id.apply(format_path_nobg).values
trainval_labels_new = np.float32(train_data.loc[:, 'healthy':'scab'].values)
train_paths_new, valid_paths_new, train_labels_new, valid_labels_new =\
train_test_split(trainval_paths_new,
trainval_labels_new,
test_size=VALIDATION_SIZE,
random_state=SEED)
print('train samples: ', len(train_paths_new))
print('valid samples: ', len(valid_paths_new))
print('test samples: ', len(test_paths_new))
print('path example: ', train_paths_new[0])
print('label example: ', train_labels_new[0])
train_dataset_new = (
tf.data.Dataset
.from_tensor_slices((train_paths_new, train_labels_new))
.map(decode_image, num_parallel_calls=AUTO)
.cache()
.map(data_augment, num_parallel_calls=AUTO)
.repeat()
.shuffle(512)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
valid_dataset_new = (
tf.data.Dataset
.from_tensor_slices((valid_paths_new, valid_labels_new))
.map(decode_image, num_parallel_calls=AUTO)
.batch(BATCH_SIZE)
.cache()
.prefetch(AUTO)
)
test_dataset_new = (
tf.data.Dataset
.from_tensor_slices(test_paths_new)
.map(decode_image, num_parallel_calls=AUTO)
.map(data_augment, num_parallel_calls=AUTO)
.batch(BATCH_SIZE)
)
record_ls = []
for cnn in ['ResNet101V2', 'VGG16', 'InceptionResNetV2', 'MobileNetV2']:
print(f'Finitune {cnn}...')
record = OrderedDict()
with strategy.scope():
# build model
backbone = get_backbone(cnn)
model = tf.keras.Sequential([
backbone,
L.GlobalMaxPooling2D(),
# L.Dropout(0.3),
L.Dense(4, activation='softmax')
])
model.compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics=['categorical_accuracy']
)
model.summary()
ckpt_path = os.path.join(ckpt_dir,
f'finetuned_{cnn}_nobg.h5')
checkpoint = tf.keras.callbacks.ModelCheckpoint(
ckpt_path,
verbose=1,
monitor='val_categorical_accuracy',
save_best_only=True,
mode='auto')
STEPS_PER_EPOCH = train_labels.shape[0] // BATCH_SIZE
history = model.fit(
train_dataset_new,
epochs=EPOCHS,
callbacks=[lr_callback, checkpoint],
steps_per_epoch=STEPS_PER_EPOCH,
validation_data=valid_dataset_new
)
# display training curves
display_training_curves(
history.history['loss'],
history.history['val_loss'],
'loss', 211)
display_training_curves(
history.history['categorical_accuracy'],
history.history['val_categorical_accuracy'],
'accuracy', 212)
plt.show()
record['model'] = f'finetuned_{cnn}_nobg'
best_idx = np.argmax(history.history['val_categorical_accuracy'])
record['train_loss'] = history.history['loss'][best_idx]
record['valid_loss'] = history.history['val_loss'][best_idx]
record['train_acc'] = history.history['categorical_accuracy'][best_idx]
record['valid_acc'] = history.history['val_categorical_accuracy'][best_idx]
record_ls.append(record)
# run testing with best model weights
model.load_weights(ckpt_path)
print('record: ', record['valid_loss'], record['valid_acc'])
# val_loss, val_acc = model.evaluate(valid_dataset)
# print('confirmation: ', val_loss, val_acc)
print('Start inference on test dataset.')
probs = model.predict(test_dataset_new, verbose=1)
sub.loc[:, 'healthy':] = probs
sub.to_csv(os.path.join(submission_dir,
f'finetune_{cnn}_nobg.csv'),
index=False)
# sub.head()
# release memory
# https://forums.fast.ai/t/how-could-i-release-gpu-memory-of-keras/2023/19
del model
K.clear_session()
gc.collect()
report_df = pd.DataFrame(record_ls)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(report_df)
report_df.to_csv(os.path.join('../output',
f'fintune_cnn_nobg_report.csv'),
index=False)
For all models, the accuracy is slightly decreased comapared to no background removal scenario. But it seems the overfitting problem is eased for InceptionResNetV2, since the training accuracy and validation accuracy are very close.
Ensembling involves the averaging of multiple prediction vectos to reduce errors and improve accuracy. Now, I will ensemble predictions from DenseNet and EfficientNet to (hopefully) produce better results.
ensemble_subs = ['finetune_MobileNetV2.csv',
'finetune_ResNet101V2.csv',
'finetune_InceptionResNetV2.csv']
sub = pd.read_csv(SUB_PATH)
final = 0
cnt = len(ensemble_subs)
for sub_name in ensemble_subs:
df = pd.read_csv(os.path.join(submission_dir, sub_name))
prob = df.loc[:, 'healthy':].to_numpy()
# display(df)
# print(prob)
final += prob
final = final / cnt
sub.loc[:, 'healthy':] = final
sub.to_csv(os.path.join(submission_dir, 'ensemble.csv'),
index=False)